27  Correlation

27.1 Karl Pearson’s Coefficient of Correlation

Karl Pearson’s correlation coefficient, also known as the Pearson product-moment correlation coefficient (PPMCC) or simply Pearson’s correlation, is a measure used in statistics to determine the degree of linear relationship between two variables. It’s widely used in the sciences to quantify the linear correlation between datasets.

27.1.1 Assumptions

Pearson’s correlation requires certain assumptions about the data it is used to analyze:

  1. Linearity: The relationship between the two variables should be linear.
  2. Homoscedasticity: The variances along the line of best fit remain similar as the value of the predictor variable increases.
  3. Normally Distributed Variables: Both variables being tested should follow a normal distribution.

27.1.2 Formula

The Pearson correlation coefficient (\(r\)) is calculated using the following formula: \[ r = \frac{n(\sum xy) - (\sum x)(\sum y)}{\sqrt{[n\sum x^2 - (\sum x)^2][n\sum y^2 - (\sum y)^2]}} \] Where: - \(n\) is the number of data points. - \(x\) and \(y\) are the variables for which the correlation is being calculated. - \(\sum\) represents the summation symbol, aggregating all values of \(x\), \(y\), \(xy\), \(x^2\), and \(y^2\).

27.1.3 Interpretation

The value of \(r\) ranges from -1 to +1: - +1 indicates a perfect positive linear relationship, - -1 indicates a perfect negative linear relationship, - 0 means no linear relationship exists. Values close to +1 or -1 indicate a strong relationship, while values close to 0 indicate a weak relationship.

27.1.4 Example Problem

Suppose we want to determine the relationship between hours studied and scores obtained in an exam. Here are the data for 5 students:

  • Hours Studied: 2, 4, 6, 8, 10
  • Scores: 20, 40, 60, 80, 100

Hypotheses:

  • Null Hypothesis (H₀): There is no linear correlation between hours studied and scores (\(r = 0\)).
  • Alternative Hypothesis (H₁): There is a linear correlation between hours studied and scores (\(r \neq 0\)).

Calculate Pearson’s r:

Using the data points provided, the calculation involves first calculating sums and products needed:

  1. Sum of Hours Studied (\(\sum x\)): \(2 + 4 + 6 + 8 + 10 = 30\)
  2. Sum of Scores (\(\sum y\)): \(20 + 40 + 60 + 80 + 100 = 300\)
  3. Sum of the product of hours and scores (\(\sum xy\)): \(2*20 + 4*40 + 6*60 + 8*80 + 10*100 = 1300\)
  4. Sum of the squares of hours (\(\sum x^2\)): \(2^2 + 4^2 + 6^2 + 8^2 + 10^2 = 220\)
  5. Sum of the squares of scores (\(\sum y^2\)): \(20^2 + 40^2 + 60^2 + 80^2 + 100^2 = 30000\)

Plugging these values into the formula gives: \[ r = \frac{5(1300) - (30)(300)}{\sqrt{[5(220) - (30)^2][5(30000) - (300)^2]}} = 1 \]

Conclusion:

Since \(r = 1\), there is a perfect positive linear relationship between the hours studied and the scores obtained, supporting the alternative hypothesis that there is a significant linear relationship.

Pearson’s Correlation using Excel:

Download the Excel file link here

Pearson’s Correlation using R:

Code
R
# Data for calculation
hours_studied <- c(2, 4, 6, 8, 10)
scores <- c(20, 40, 60, 80, 100)

# Perform Pearson's correlation
cor_result <- cor(hours_studied, scores)

# Print the results
print(cor_result)
[1] 1

Pearson’s Correlation using Python:

Code
Python
import numpy as np

# Data for calculation
hours_studied = np.array([2, 4, 6, 8, 10])
scores = np.array([20, 40, 60, 80, 100])

# Perform Pearson's correlation
correlation_coefficient = np.corrcoef(hours_studied, scores)[0, 1]

# Print the results
print("Pearson's correlation coefficient:", correlation_coefficient)
Pearson's correlation coefficient: 1.0